Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | [your name here] | [your id here] | [your email here] |
| Student 2 | [your name here] | [your id here] | [your email here] |
In this assignment we'll create a from-scratch implementation of multilayer perceptrons, the core building block of deep neural networks. We'll visualize decision bounrdaries and ROC curves in the context of binary classification. Following that we will focus on convolutional networks with residual blocks. We'll use PyTorch to create our own network architectures and train them using GPUs on the course servers, and we'll conduct architecture experiments to determine the the effects of different architectural decisions on the performance of deep networks.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.In this part we'll implement a general purpose MLP and Binary Classifier using pytorch.
We'll implement its training, and also learn about decision boundaries an threshold selection in the context of binary classification. Finally, we'll explore the effect of depth and width on an MLP's performance.
import os
import re
import sys
import glob
import unittest
from typing import Sequence, Tuple
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as tvtf
from torch import Tensor
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
To test our first neural network-based classifiers we'll start by creating a toy binary classification dataset, but one which is not trivial for a linear model.
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
def rotate_2d(X, deg=0):
"""
Rotates each 2d sample in X of shape (N, 2) by deg degrees.
"""
a = np.deg2rad(deg)
return X @ np.array([[np.cos(a), -np.sin(a)],[np.sin(a), np.cos(a)]]).T
def plot_dataset_2d(X, y, n_classes=2, alpha=0.2, figsize=(8, 6), title=None, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
for c in range(n_classes):
ax.scatter(*X[y==c,:].T, alpha=alpha, label=f"class {c}");
ax.set_xlabel("$x_1$"); ax.set_ylabel("$x_2$");
ax.legend(); ax.set_title((title or '') + f" (n={len(y)})")
We'll split our data into 80% train and validation, and 20% test. To make it a bit more challenging, we'll simulate a somewhat real-world setting where there are multiple populations, and the training/validation data is not sampled iid from the underlying data distribution.
np.random.seed(seed)
N = 10_000
N_train = int(N * .8)
# Create data from two different distributions for the training/validation
X1, y1 = make_moons(n_samples=N_train//2, noise=0.2)
X1 = rotate_2d(X1, deg=10)
X2, y2 = make_moons(n_samples=N_train//2, noise=0.25)
X2 = rotate_2d(X2, deg=50)
# Test data comes from a similar but noisier distribution
X3, y3 = make_moons(n_samples=(N-N_train), noise=0.3)
X3 = rotate_2d(X3, deg=40)
X, y = np.vstack([X1, X2, X3]), np.hstack([y1, y2, y3])
# Train and validation data is from mixture distribution
X_train, X_valid, y_train, y_valid = train_test_split(X[:N_train, :], y[:N_train], test_size=1/3, shuffle=False)
# Test data is only from the second distribution
X_test, y_test = X[N_train:, :], y[N_train:]
fig, ax = plt.subplots(1, 3, figsize=(20, 5))
plot_dataset_2d(X_train, y_train, title='Train', ax=ax[0]);
plot_dataset_2d(X_valid, y_valid, title='Validation', ax=ax[1]);
plot_dataset_2d(X_test, y_test, title='Test', ax=ax[2]);
Now let us create a data loader for each dataset.
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
batch_size = 32
dl_train, dl_valid, dl_test = [
DataLoader(
dataset=TensorDataset(
torch.from_numpy(X_).to(torch.float32),
torch.from_numpy(y_)
),
shuffle=True,
num_workers=0,
batch_size=batch_size
)
for X_, y_ in [(X_train, y_train), (X_valid, y_valid), (X_test, y_test)]
]
print(f'{len(dl_train.dataset)=}, {len(dl_valid.dataset)=}, {len(dl_test.dataset)=}')
len(dl_train.dataset)=5333, len(dl_valid.dataset)=2667, len(dl_test.dataset)=2000
A multilayer-perceptron is arguably a the most basic type of neural network model. It is composed of $L$ layers, each layer $l$ with $n_l$ perceptron ("neuron") units. Each perceptron is connected to all ouputs of the previous layer (or all inputs in the first layer), calculates their weighted sum, applies a linearity and produces a single output.

Each layer $l$ operates on the output of the previous layer ($\vec{y}_{l-1}$) and calculates:
$$ \vec{y}_l = \varphi\left( \mat{W}_l \vec{y}_{l-1} + \vec{b}_l \right),~ \mat{W}_l\in\set{R}^{n_{l}\times n_{l-1}},~ \vec{b}_l\in\set{R}^{n_l},~ l \in \{1,2,\dots,L\}. $$To begin, let's implement a general multi-layer perceptron model. We'll seek to implement it in a way which is both general in terms of architecture, and also composable so that we can use our MLP in the context of larger models.
TODO: Implement the MLP class in the hw2/mlp.py module.
from hw2.mlp import MLP
mlp = MLP(
in_dim=2,
dims=[8, 16, 32, 64],
nonlins=['relu', 'tanh', nn.LeakyReLU(0.314), 'softmax']
)
mlp
MLP(
(fc_layers): Sequential(
(0): Linear(in_features=2, out_features=8, bias=True)
(1): ReLU()
(2): Linear(in_features=8, out_features=16, bias=True)
(3): Tanh()
(4): Linear(in_features=16, out_features=32, bias=True)
(5): LeakyReLU(negative_slope=0.314)
(6): Linear(in_features=32, out_features=64, bias=True)
(7): Softmax(dim=1)
)
)
Let's try our implementation on a batch of data.
x0, y0 = next(iter(dl_train))
yhat0 = mlp(x0)
test.assertEqual(len([*mlp.parameters()]), 8)
test.assertEqual(yhat0.shape, (batch_size, mlp.out_dim))
test.assertTrue(torch.allclose(torch.sum(yhat0, dim=1), torch.tensor(1.0)))
test.assertIsNotNone(yhat0.grad_fn)
yhat0
tensor([[0.0126, 0.0169, 0.0145, ..., 0.0189, 0.0193, 0.0185],
[0.0122, 0.0162, 0.0143, ..., 0.0187, 0.0176, 0.0169],
[0.0124, 0.0165, 0.0144, ..., 0.0189, 0.0184, 0.0173],
...,
[0.0125, 0.0165, 0.0141, ..., 0.0189, 0.0177, 0.0182],
[0.0123, 0.0162, 0.0142, ..., 0.0186, 0.0175, 0.0171],
[0.0127, 0.0169, 0.0145, ..., 0.0189, 0.0192, 0.0186]],
grad_fn=<SoftmaxBackward0>)
The MLP model we've implemented, while useful, is very general. For the task of binary classification, we would like to add some additional functionality to it: the ability to output a normalized score for a sample being in class one (which we interpret as a probability) and a prediction based on some threshold of this probability. In addition, we need some way to calculate a meaningful threshold based on the data and a trained model at hand.
In order to maintain generality, we'll add this functionlity in the form of a wrapper: A BinaryClassifier class that can wrap any model producing two output features, and provide the the functionality stated above.
TODO: In the hw2/classifier.py module, implement the BinaryClassifier and the missing parts of its base class, Classifier. Read the method documentation carefully and implement accordingly.
You can ignore the roc_threshold method at this stage.
from hw2.classifier import BinaryClassifier
bmlp4 = BinaryClassifier(
model=MLP(in_dim=2, dims=[*[10]*3, 2], nonlins=[*['relu']*3, 'none']),
threshold=0.5
)
print(bmlp4)
# Test model
test.assertEqual(len([*bmlp4.parameters()]), 8)
test.assertIsNotNone(bmlp4(x0).grad_fn)
# Test forward
yhat0_scores = bmlp4(x0)
test.assertEqual(yhat0_scores.shape, (batch_size, 2))
test.assertFalse(torch.allclose(torch.sum(yhat0_scores, dim=1), torch.tensor(1.0)))
# Test predict_proba
yhat0_proba = bmlp4.predict_proba(x0)
test.assertEqual(yhat0_proba.shape, (batch_size, 2))
test.assertTrue(torch.allclose(torch.sum(yhat0_proba, dim=1), torch.tensor(1.0)))
# Test classify
yhat0 = bmlp4.classify(x0)
test.assertEqual(yhat0.shape, (batch_size,))
test.assertEqual(yhat0.dtype, torch.int)
test.assertTrue(all(yh_ in (0, 1) for yh_ in yhat0))
BinaryClassifier(
(model): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=2, out_features=10, bias=True)
(1): ReLU()
(2): Linear(in_features=10, out_features=10, bias=True)
(3): ReLU()
(4): Linear(in_features=10, out_features=10, bias=True)
(5): ReLU()
(6): Linear(in_features=10, out_features=2, bias=True)
(7): Identity()
)
)
)
Now that we have a classifier, we need to train it.
We will abstract the various aspects of training such as mlutiple epochs, iterating over batches, early stopping and saving model checkpoints, into a Trainer that will take care of these concerns.
The Trainer class splits the task of training (and evaluating) models into three conceptual levels,
fit method, which returns a FitResult containing losses and accuracies for all epochs.train_epoch and test_epoch methods, which return an EpochResult containing losses per batch and the single accuracy result of the epoch.train_batch and test_batch methods, which return a BatchResult containing a single loss and the number of correctly classified samples in the batch.It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.
TODO:
Implement the Trainer's fit method and the ClassifierTrainer's train_batch/test_batch methods, in the hw2/training.py module. You may ignore the Optional parts about early stopping an model checkpoints at this stage.
Set the model's architecture hyper-parameters and the optimizer hyperparameters in part1_arch_hp() and part1_optim_hp(), respectively, in hw2/answers.py.
Since this is a toy dataset, you should be able to quickly get above 85% accuracy even on the test set.
from hw2.training import ClassifierTrainer
from hw2.answers import part1_arch_hp, part1_optim_hp
torch.manual_seed(seed)
hp_arch = part1_arch_hp()
hp_optim = part1_optim_hp()
model = BinaryClassifier(
model=MLP(
in_dim=2,
dims=[*[hp_arch['hidden_dims'],]*hp_arch['n_layers'], 2],
nonlins=[*[hp_arch['activation'],]*hp_arch['n_layers'], hp_arch['out_activation']]
),
threshold=0.5,
)
print(model)
loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)
trainer = ClassifierTrainer(model, loss_fn, optimizer)
fit_result = trainer.fit(dl_train, dl_valid, num_epochs=20, print_every=10);
test.assertGreaterEqual(fit_result.train_acc[-1], 85.0)
test.assertGreaterEqual(fit_result.test_acc[-1], 75.0)
BinaryClassifier(
(model): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=2, out_features=10, bias=True)
(1): ReLU()
(2): Linear(in_features=10, out_features=10, bias=True)
(3): ReLU()
(4): Linear(in_features=10, out_features=10, bias=True)
(5): ReLU()
(6): Linear(in_features=10, out_features=2, bias=True)
(7): Identity()
)
)
)
--- EPOCH 1/20 ---
--- EPOCH 11/20 ---
--- EPOCH 20/20 ---
from cs236781.plot import plot_fit
plot_fit(fit_result, log_loss=False, train_test_overlay=True);
An important part of understanding what a non-linear classifier like our MLP is doing is visualizing it's decision boundaries. When we only have two input features, these are relatively simple to visualize, since we can simply plot our data on the plane, and evaluate our classifier on a constant 2D grid in order to approximate the decision boundary.
TODO: Implement the plot_decision_boundary_2d function in the hw2/classifier.py module.
from hw2.classifier import plot_decision_boundary_2d
fig, ax = plot_decision_boundary_2d(model, *dl_valid.dataset.tensors)
Another important component, especially in the context of binary classification is threshold selection. Until now, we arbitrarily chose a threshold of 0.5 when deciding the class label based on the probability score we calculated via softmax. In other words, we classified a sample to class 1 (the 'positive' class) when it's probability score was greater or equal to 0.5.
However, in real-world classifiction problems we'll need to choose our threshold wisely based on the domain-specific requirements of the problem. For example, depending on our application, we might care more about high sensitivity (correctly classifying positive examples), while for other applications specificity (correctly classifying negative examples) is more important.
One way to understand the mistakes a model is making is to look at its Confusion Matrix. From it, we easily see e.g. the false-negative rate (FNR) and false-positive rate (FPR).
Let's look at the confusion matrices on the test and validation data using the model we trained above.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
def plot_confusion(classifier, x: np.ndarray, y: np.ndarray, ax=None):
y_hat = classifier.classify(torch.from_numpy(x).to(torch.float32)).numpy()
conf_mat = confusion_matrix(y, y_hat, normalize='all')
ConfusionMatrixDisplay(conf_mat).plot(ax=ax, colorbar=False)
model.threshold = 0.5
_, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].set_title("Train"); axes[1].set_title("Validation");
plot_confusion(model, X_train, y_train, ax=axes[0])
plot_confusion(model, X_valid, y_valid, ax=axes[1])
We can see that the model makes a different number of false-posiive and false-negative errors. Clearly, this proportion would change if the classification threshold was different.
A very common way to select the classification threshold is to find a threshold which optimally balances between the FPR and FNR.
This can be done by plotting the model's ROC curve, which shows 1-FNR vs. FPR for multiple threshold values, and selecting the point closest to the ideal point ((0, 1)).
TODO: Implement the select_roc_thresh function in the hw2.classifier module.
from hw2.classifier import select_roc_thresh
optimal_thresh = select_roc_thresh(model, *dl_valid.dataset.tensors, plot=True)
Let's see the effect of our threshold selection on the confusion matrix and decision boundary.
model.threshold = optimal_thresh
_, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].set_title("Train"); axes[1].set_title("Validation");
plot_confusion(model, X_train, y_train, ax=axes[0])
plot_confusion(model, X_valid, y_valid, ax=axes[1])
fig, ax = plot_decision_boundary_2d(model, *dl_valid.dataset.tensors)
Now, equipped with the tools we've implemented so far we'll expertiment with various MLP architectures. We'll seek to study the effect of the models depth (number of hidden layers) and width (number of neurons per hidden layer) on the its decision boundaries and the resulting performance. After training, we will use the validation set for threshold selection, and seek to maximize the performance on the test set.
TODO: Implement the mlp_experiment function in hw2/experiments.py.
You are free to configure any model and optimization hyperparameters however you like, except for the specified width and depth.
Experiment with various options for these other hyperparameters and try to obtain the best results you can.
from itertools import product
from tqdm.auto import tqdm
from hw2.experiments import mlp_experiment
torch.manual_seed(seed)
depths = [1, 2, 4]
widths = [2, 8, 32, 128]
exp_configs = product(enumerate(widths), enumerate(depths))
fig, axes = plt.subplots(len(widths), len(depths), figsize=(10*len(depths), 10*len(widths)), squeeze=False)
test_accs = []
for (i, width), (j, depth) in tqdm(list(exp_configs)):
model, thresh, valid_acc, test_acc = mlp_experiment(
depth, width, dl_train, dl_valid, dl_test, n_epochs=10
)
test_accs.append(test_acc)
fig, ax = plot_decision_boundary_2d(model, *dl_test.dataset.tensors, ax=axes[i, j])
ax.set_title(f"{depth=}, {width=}")
ax.text(ax.get_xlim()[0]*.95, ax.get_ylim()[1]*.95, f"{thresh=:.2f}\n{valid_acc=:.1f}%\n{test_acc=:.1f}%", va="top")
# Assert minimal performance requirements.
# You should be able to do better than these by at least 5%.
test.assertGreaterEqual(np.min(test_accs), 75.0)
test.assertGreaterEqual(np.quantile(test_accs, 0.75), 85.0)
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Consider the first binary classifier you trained in this notebook and the loss/accuracy curves we plotted for it on the train and validation sets, as well as the decision boundary plot.
Based on those plots, explain qualitatively whether or now your model has:
Explain your answers for each of the above. Since this is a qualitative question, assume "high" simply means "I would take measures in order to decrease it further".
display_answer(hw2.answers.part1_q1)
Your answer:
Consider the first binary classifier you trained in this notebook and the confusion matrices we plotted for it.
For the validation dataset, would you expect the FPR or the FNR to be higher, and why? Recall that you have full knowledge of the data generating process.
display_answer(hw2.answers.part1_q2)
Your answer: FNR will be higher. When generating data, noise magnitude of validation data is larger than training data. The function of the model is to prediction the lable of one sample is 1 (positive) or not, so the larger noise will make the model output less 1, which means the FNR will be larger.
You're training a binary classifier screening of a large cohort of patients for some disease, with the aim to detect the disease early, before any symptoms appear. You train the model on easy-to-obtain features, so screening each individual patient is simple and low-cost. In case the model classifies a patient as sick, she must then be sent to furhter testing in order to confirm the illness. Assume that these further tests are expensive and involve high-risk to the patient. Assume also that once diagnosed, a low-cost treatment exists.
You wish to screen as many people as possible at the lowest possible cost and loss of life. Would you still choose the same "optimal" point on the ROC curve as above? If not, how would you choose it? Answer these questions for two possible scenarios:
Explain your answers.
display_answer(hw2.answers.part1_q3)
Your answer:
In this case, the main focus is to decrease the further testing cost after the 'positive' prediction of our model, so the FPR should be as small as possible. Meanwhile, because the disease can develop obvious non-lethal symptoms and then be well treated at a low cost, it doesn't matter if one patient with the desease is detected as "negative" and the FNR can be high. Therefore, the left bottom point of the ROC curve can be chosen.
In this case, the FNR must be minimum to decrease the loss of life and the FPR should be small to decrease the testing cost, so the left top point of the ROC curve should be chosen.
Analyze your results from the Architecture Experiment.
depth, width varies).width, depth varies).depth=1, width=32 and depth=4, width=8depth=1, width=128 and depth=4, width=32display_answer(hw2.answers.part1_q4)
Your answer:
In this part we will explore convolution networks. We'll implement a common block-based deep CNN pattern with an without residual connections.
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Convolutional layers are the most essential building blocks of the state of the art deep learning image classification models and also play an important role in many other tasks. As we saw in the tutorial, when applied to images, convolutional layers operate on and produce volumes (3D tensors) of activations.
A convenient way to interpret convolutional layers for images is as a collection of 3D learnable filters, each of which operates on a small spatial region of the input volume. Each filter is convolved with the input volume ("slides over it"), and a dot product is computed at each location followed by a non-linearity which produces one activation. All these activations produce a 2D plane known as a feature map. Multiple feature maps (one for each filter) comprise the output volume.

A crucial property of convolutional layers is their translation equivariance, i.e. shifting the input results in and equivalently shifted output. This produces the ability to detect features regardless of their spatial location in the input.
Convolutional network architectures usually follow a pattern basic repeating blocks: one or more convolution layers, each followed by a non-linearity (generally ReLU) and then a pooling layer to reduce spatial dimensions. Usually, the number of convolutional filters increases the deeper they are in the network. These layers are meant to extract features from the input. Then, one or more fully-connected layers is used to combine the extracted features into the required number of output class scores.
PyTorch provides all the basic building blocks needed for creating a convolutional arcitecture within the torch.nn package.
Let's use them to create a basic convolutional network with the following architecture pattern:
[(CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
Here $N$ is the total number of convolutional layers, $P$ specifies how many convolutions to perform before each pooling layer and $M$ specifies the number of hidden fully-connected layers before the final output layer.
TODO: Complete the implementaion of the CNN class in the hw2/cnn.py module.
Use PyTorch's nn.Conv2d and nn.MaxPool2d for the convolution and pooling layers.
It's recommended to implement the missing functionality in the order of the class' methods.
from hw2.cnn import CNN
test_params = [
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=3, stride=1, padding=1),
activation_type='relu', activation_params=dict(),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=5, stride=2, padding=3),
activation_type='lrelu', activation_params=dict(negative_slope=0.05),
pooling_type='avg', pooling_params=dict(kernel_size=3),
),
dict(
in_size=(3,100,100), out_classes=3,
channels=[16]*5, pool_every=3, hidden_dims=[100]*1,
conv_params=dict(kernel_size=2, stride=2, padding=2),
activation_type='lrelu', activation_params=dict(negative_slope=0.1),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = CNN(**params)
print(f"\n=== test {i=} ===")
print(net)
torch.manual_seed(seed)
test_out = net(torch.ones(1, 3, 100, 100))
print(f'{test_out=}')
expected_out = torch.load(f'tests/assets/expected_conv_out_{i:02d}.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
(2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU()
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU()
(7): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU()
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=20000, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0745, -0.1058, 0.0928, 0.0476, 0.0057, 0.0051, 0.0938, -0.0582,
0.0573, 0.0583]], grad_fn=<AddmmBackward0>)
max_diff=1.1175870895385742e-08
=== test i=1 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(1): LeakyReLU(negative_slope=0.05)
(2): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(3): LeakyReLU(negative_slope=0.05)
(4): AvgPool2d(kernel_size=3, stride=3, padding=0)
(5): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(6): LeakyReLU(negative_slope=0.05)
(7): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(8): LeakyReLU(negative_slope=0.05)
(9): AvgPool2d(kernel_size=3, stride=3, padding=0)
)
(mlp): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=32, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.05)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.05)
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0724, -0.0030, 0.0637, -0.0073, 0.0932, -0.0662, -0.0656, 0.0076,
0.0193, 0.0241]], grad_fn=<AddmmBackward0>)
max_diff=1.4901161193847656e-08
=== test i=2 ===
CNN(
(feature_extractor): Sequential(
(0): Conv2d(3, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(1): LeakyReLU(negative_slope=0.1)
(2): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(3): LeakyReLU(negative_slope=0.1)
(4): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(5): LeakyReLU(negative_slope=0.1)
(6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(7): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(8): LeakyReLU(negative_slope=0.1)
(9): Conv2d(16, 16, kernel_size=(2, 2), stride=(2, 2), padding=(2, 2))
(10): LeakyReLU(negative_slope=0.1)
)
(mlp): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=400, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.1)
(2): Linear(in_features=100, out_features=3, bias=True)
(3): Identity()
)
)
)
test_out=tensor([[-0.0004, -0.0094, 0.0817]], grad_fn=<AddmmBackward0>)
max_diff=1.862645149230957e-09
As before, we'll wrap our model with a Classifier that provides the necessary functionality for calculating probability scores and obtaining class label predictions.
This time, we'll use a simple approach that simply selects the class with the highest score.
TODO: Implement the ArgMaxClassifier in the hw2/classifier.py module.
from hw2.classifier import ArgMaxClassifier
model = ArgMaxClassifier(model=CNN(**test_params[0]))
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test.assertEqual(model.classify(test_image).shape, (1,))
test.assertEqual(model.predict_proba(test_image).shape, (1, 10))
test.assertAlmostEqual(torch.sum(model.predict_proba(test_image)).item(), 1.0, delta=1e-3)
Let's now load CIFAR-10 to use as our dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
x0,_ = ds_train[0]
in_size = x0.shape
num_classes = 10
print('input image size =', in_size)
Files already downloaded and verified Files already downloaded and verified Train: 50000 samples Test: 10000 samples input image size = torch.Size([3, 32, 32])
Now as usual, as a sanity test let's make sure we can overfit a tiny dataset with our model. But first we need to adapt our Trainer for PyTorch models.
TODO:
ClassifierTrainer class in the hw2/training.py module if you haven't done so already.part2_optim_hp(), respectively, in hw2/answers.py.from hw2.training import ClassifierTrainer
from hw2.answers import part2_optim_hp
torch.manual_seed(seed)
# Define a tiny part of the CIFAR-10 dataset to overfit it
batch_size = 2
max_batches = 25
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Create model, loss and optimizer instances
model = ArgMaxClassifier(
model=CNN(
in_size, num_classes, channels=[32], pool_every=1, hidden_dims=[100],
conv_params=dict(kernel_size=3, stride=1, padding=1),
pooling_params=dict(kernel_size=2),
)
)
hp_optim = part2_optim_hp()
loss_fn = hp_optim.pop('loss_fn')
optimizer = torch.optim.SGD(params=model.parameters(), **hp_optim)
# Use ClassifierTrainer to run only the training loop a few times.
trainer = ClassifierTrainer(model, loss_fn, optimizer, device)
best_acc = 0
for i in range(25):
res = trainer.train_epoch(dl_train, max_batches=max_batches, verbose=(i%5==0))
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
# Test overfitting
test.assertGreaterEqual(best_acc, 90)
A very common addition to the basic convolutional architecture described above are shortcut connections. First proposed by He et al. (2016), this simple addition has been shown to be crucial ingredient in order to achieve effective learning with very deep networks. Virtually all state of the art image classification models from recent years use this technique.
The idea is to add an shortcut, or skip, around every two or more convolutional layers:

This adds an easy way for the network to learn identity mappings: set the weight values to be very small. The consequence is that the convolutional layers to learn a residual mapping, i.e. some delta that is applied to the identity map, instead of actually learning a completely new mapping from scratch.
Lets start by implementing a general residual block, representing a structure similar to the above diagrams. Our residual block will be composed of:
1x1 convolution to project the channel dimension.TODO: Complete the implementation of the ResidualBlock's __init__() method in the hw2/cnn.py module.
from hw2.cnn import ResidualBlock
torch.manual_seed(seed)
resblock = ResidualBlock(
in_channels=3, channels=[6, 4]*2, kernel_sizes=[3, 5]*2,
batchnorm=True, dropout=0.2
)
print(resblock)
torch.manual_seed(seed)
test_out = resblock(torch.ones(1, 3, 32, 32))
print(f'{test_out.shape=}')
expected_out = torch.load('tests/assets/expected_resblock_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.2, inplace=False)
(2): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(5): Dropout2d(p=0.2, inplace=False)
(6): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): ReLU()
(8): Conv2d(4, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.2, inplace=False)
(10): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
test_out.shape=torch.Size([1, 4, 32, 32])
max_diff=0.0
In the ResNet Block diagram shown above, the right block is called a bottleneck block. This type of block is mainly used deep in the network, where the feature space becomes increasingly high-dimensional (i.e. there are many channels).
Instead of applying a KxK conv layer on the original input channels, a bottleneck block first projects to a lower number of features (channels), applies the KxK conv on the result, and then projects back to the original feature space. Both projections are performed with 1x1 convolutions.
TODO: Complete the implementation of the ResidualBottleneckBlock in the hw2/cnn.py module.
from hw2.cnn import ResidualBottleneckBlock
torch.manual_seed(seed)
resblock_bn = ResidualBottleneckBlock(
in_out_channels=256, inner_channels=[64, 32, 64], inner_kernel_sizes=[3, 5, 3],
batchnorm=False, dropout=0.1, activation_type="lrelu"
)
print(resblock_bn)
# Test a forward pass
torch.manual_seed(seed)
test_in = torch.ones(1, 256, 32, 32)
test_out = resblock_bn(test_in)
print(f'{test_out.shape=}')
assert test_out.shape == test_in.shape
expected_out = torch.load('tests/assets/expected_resblock_bn_out.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Dropout2d(p=0.1, inplace=False)
(5): LeakyReLU(negative_slope=0.01)
(6): Conv2d(64, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(7): Dropout2d(p=0.1, inplace=False)
(8): LeakyReLU(negative_slope=0.01)
(9): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): Dropout2d(p=0.1, inplace=False)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential()
)
test_out.shape=torch.Size([1, 256, 32, 32])
max_diff=0.0
Now, based on the ResidualBlock, we'll implement our own variation of a residual network (ResNet),
with the following architecture:
[-> (CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
\------- SKIP ------/
Note that $N$, $P$ and $M$ are as before, however now $P$ also controls the number of convolutional layers to add a skip-connection to.
TODO: Complete the implementation of the ResNet class in the hw2/cnn.py module.
You must use your ResidualBlocks or ResidualBottleneckBlocks to group together every $P$ convolutional layers.
from hw2.cnn import ResNet
test_params = [
dict(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
activation_type='lrelu', activation_params=dict(negative_slope=0.01),
pooling_type='avg', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
bottleneck=False
),
dict(
# create 64->16->64 bottlenecks
in_size=(3,100,100), out_classes=5, channels=[64, 16, 64]*4,
pool_every=3, hidden_dims=[64]*1,
activation_type='tanh',
pooling_type='max', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
bottleneck=True
)
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = ResNet(**params)
print(f"\n=== test {i=} ===")
print(net)
torch.manual_seed(seed)
test_out = net(torch.ones(1, 3, 100, 100))
print(f'{test_out=}')
expected_out = torch.load(f'tests/assets/expected_resnet_out_{i:02d}.pt')
print(f'max_diff={torch.max(torch.abs(test_out - expected_out)).item()}')
test.assertTrue(torch.allclose(test_out, expected_out, atol=1e-3))
=== test i=0 ===
ResNet(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.01)
(8): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.1, inplace=False)
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(2): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential()
)
)
(mlp): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=160000, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
test_out=tensor([[ 0.0422, 0.0332, 0.1870, -0.0532, -0.0742, 0.1143, -0.0617, -0.0467,
0.0852, 0.0221]], grad_fn=<AddmmBackward0>)
max_diff=0.0
64 b
64 b
64 b
=== test i=1 ===
ResNet(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(64, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(2): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential()
)
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential()
)
(5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(64, 16, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Tanh()
(4): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Tanh()
(8): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential()
)
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=2304, out_features=64, bias=True)
(1): Tanh()
(2): Linear(in_features=64, out_features=5, bias=True)
(3): Identity()
)
)
)
test_out=tensor([[ 0.0237, -0.1945, -0.0085, -0.4024, -0.2667]],
grad_fn=<AddmmBackward0>)
max_diff=0.0
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Consider the bottleneck block from the right side of the ResNet diagram above. Compare it to a regular block that performs a two 3x3 convs directly on the 256-channel input (i.e. as shown in the left side of the diagram, with a different number of channels). Explain the differences between the regular block and the bottleneck block in terms of:
display_answer(hw2.answers.part2_q1)
The number of regular block is $layer1 + layer 2: (3*3*256+1)*256 + (3*3*256+1)*256 = 1180160.$ As we expected, there are much less parameters in the bottleneck case (2 orders of magnitude).
In a bottleneck block: $layer1 + Relu + layer2 + Relu + layer3 + skip connection + Relu =$ $1*1*256*H*W*64 + 64*H*W + 3*3*64*H*W*64 + 64*H*W + 1*1*64*H*W*256 + 256*H*W + 256*H*W = 70,272*H*W$
In a regular block: $layer1 + Relu + layer2 + relu =$ $3*3*256*H*W*256 + 256*H*W + 3*3*256*H*W*256 + 256*H*W= 1,180,160*H*W$
In this part we will explore convolution networks and the effects of their architecture on accuracy. We'll use our deep CNN implementation and perform various experiments on it while varying the architecture. Then we'll implement our own custom architecture to see whether we can get high classification results on a large subset of CIFAR-10.
Training will be performed on GPU.
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
We will now perform a series of experiments that train various model configurations on a part of the CIFAR-10 dataset.
To perform the experiments, you'll need to use a machine with a GPU since training time might be too long otherwise.
Here's an example of running a forward pass on the GPU (assuming you're running this notebook on a GPU-enabled machine).
from hw2.cnn import ResNet
net = ResNet(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
pooling_type='avg', pooling_params=dict(kernel_size=2),
)
net = net.to(device)
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_image = test_image.to(device)
test_out = net(test_image)
Notice how we called .to(device) on both the model and the input tensor.
Here the device is a torch.device object that we created above. If an nvidia GPU is available on the machine you're running this on, the device will be 'cuda'. When you run .to(device) on a model, it recursively goes over all the model parameter tensors and copies their memory to the GPU. Similarly, calling .to(device) on the input image also copies it.
In order to train on a GPU, you need to make sure to move all your tensors to it. You'll get errors if you try to mix CPU and GPU tensors in a computation.
print(f'This notebook is running with device={device}')
print(f'The model parameter tensors are also on device={next(net.parameters()).device}')
print(f'The test image is also on device={test_image.device}')
print(f'The output is therefore also on device={test_out.device}')
This notebook is running with device=cpu The model parameter tensors are also on device=cpu The test image is also on device=cpu The output is therefore also on device=cpu
First, please read the course servers guide carefully.
To run the experiments on the course servers, you can use the py-sbatch.sh script directly to perform a single experiment run in batch mode (since it runs python once), or use the srun command to do a single run in interactive mode. For example, running a single run of experiment 1 interactively (after conda activate of course):
srun -c 2 --gres=gpu:1 --pty python -m hw2.experiments run-exp -n test -K 32 64 -L 2 -P 2 -H 100
To perform multiple runs in batch mode with sbatch (e.g. for running all the configurations of an experiments), you can create your own script based on py-sbatch.sh and invoke whatever commands you need within it.
Don't request more than 2 CPU cores and 1 GPU device for your runs. The code won't be able to utilize more than that anyway, so you'll see no performance gain if you do. It will only cause delays for other students using the servers.
results folder on your local machine.
This notebook will only display the results, not run the actual experiment code (except for a demo run).run_name parameter that will also be the base name of the results file which this
notebook will expect to load.hw2/experiments.py module.
This module has a CLI parser so that you can invoke it as a script and pass in all the
configuration parameters for a single experiment run.python -m hw2.experiments run-exp to run an experiment, and not
python hw2/experiments.py run-exp, regardless of how/where you run it.In this part we will test some different architecture configurations based on our CNN and ResNet.
Specifically, we want to try different depths and number of features to see the effects these parameters have on the model's performance.
To do this, we'll define two extra hyperparameters for our model, K (filters_per_layer) and L (layers_per_block).
K is a list, containing the number of filters we want to have in our conv layers.L is the number of consecutive layers with the same number of filters to use.For example, if K=[32, 64] and L=2 it means we want two conv layers with 32 filters followed by two conv layers with 64 filters. If we also use pool_every=3, the feature-extraction part of our model will be:
Conv(X,32)->ReLu->Conv(32,32)->ReLU->Conv(32,64)->ReLU->MaxPool->Conv(64,64)->ReLU
We'll try various values of the K and L parameters in combination and see how each architecture trains. All other hyperparameters are up to you, including the choice of the optimization algorithm, the learning rate, regularization and architecture hyperparams such as pool_every and hidden_dims. Note that you should select the pool_every parameter wisely per experiment so that you don't end up with zero-width feature maps.
You can try some short manual runs to determine some good values for the hyperparameters or implement cross-validation to do it. However, the dataset size you test on should be large. If you limit the number of batches, make sure to use at least 30000 training images and 5000 validation images.
The important thing is that you state what you used, how you decided on it, and explain your results based on that.
First we need to write some code to run the experiment.
TODO:
cnn_experiment() function in the hw2/experiments.py module.Trainer class.The following block tests that your implementation works. It's also meant to show you that each experiment run creates a result file containing the parameters to reproduce and the FitResult object for plotting.
from hw2.experiments import load_experiment, cnn_experiment
from cs236781.plot import plot_fit
"""# Test experiment1 implementation on a few data samples and with a small model
cnn_experiment(
'test_run', seed=seed, bs_train=50, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64], layers_per_block=1, pool_every=1, hidden_dims=[100],
model_type='ycn',
)
# There should now be a file 'test_run.json' in your `results/` folder.
# We can use it to load the results of the experiment.
cfg, fit_res = load_experiment('results/test_run_L1_K32-64.json')
_, _ = plot_fit(fit_res, train_test_overlay=True)
# And `cfg` contains the exact parameters to reproduce it
print('experiment config: ', cfg)"""
"# Test experiment1 implementation on a few data samples and with a small model\ncnn_experiment(\n 'test_run', seed=seed, bs_train=50, batches=10, epochs=10, early_stopping=5,\n filters_per_layer=[32,64], layers_per_block=1, pool_every=1, hidden_dims=[100],\n model_type='ycn',\n)\n\n# There should now be a file 'test_run.json' in your `results/` folder.\n# We can use it to load the results of the experiment.\ncfg, fit_res = load_experiment('results/test_run_L1_K32-64.json')\n_, _ = plot_fit(fit_res, train_test_overlay=True)\n\n# And `cfg` contains the exact parameters to reproduce it\nprint('experiment config: ', cfg)"
We'll use the following function to load multiple experiment results and plot them together.
def plot_exp_results(filename_pattern, results_dir='results'):
fig = None
result_files = glob.glob(os.path.join(results_dir, filename_pattern))
result_files.sort()
if len(result_files) == 0:
print(f'No results found for pattern {filename_pattern}.', file=sys.stderr)
return
for filepath in result_files:
m = re.match('exp\d_(\d_)?(.*)\.json', os.path.basename(filepath))
cfg, fit_res = load_experiment(filepath)
fig, axes = plot_fit(fit_res, fig, legend=m[2],log_loss=True)
del cfg['filters_per_layer']
del cfg['layers_per_block']
print('common config: ', cfg)
L)¶First, we'll test the effect of the network depth on training.
Configuratons:
K=32 fixed, with L=2,4,8,16 varying per runK=64 fixed, with L=2,4,8,16 varying per runSo 8 different runs in total.
Naming runs:
Each run should be named exp1_1_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_1_L2_K32.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_1_L*_K32*.json')
common config: {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [100], 'model_type': 'cnn', 'kw': {}}
plot_exp_results('exp1_1_L*_K64*.json')
common config: {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [100], 'model_type': 'cnn', 'kw': {}}
K)¶Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
L=2 fixed, with K=[32],[64],[128],[256] varying per run.L=4 fixed, with K=[32],[64],[128],[256] varying per run.L=8 fixed, with K=[32],[64],[128],[256] varying per run.So 12 different runs in total. To clarify, each run K takes the value of a list with a single element.
Naming runs:
Each run should be named exp1_2_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_2_L2_K32.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_2_L2*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 1, 'hidden_dims': [100], 'model_type': 'cnn', 'kw': {}}
plot_exp_results('exp1_2_L4*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 2, 'hidden_dims': [100], 'model_type': 'cnn', 'kw': {}}
plot_exp_results('exp1_2_L8*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [100], 'model_type': 'cnn', 'kw': {}}
K) and network depth (L)¶Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
K=[64, 128, 256] fixed with L=1,2,3,4 varying per run.So 4 different runs in total. To clarify, each run K takes the value of an array with a three elements.
Naming runs:
Each run should be named exp1_3_L{}_K{}-{}-{} where the braces are placeholders for the values. For example, the first run should be named exp1_3_L1_K64-128-256.
TODO: Run the experiment on the above configuration with the CNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_3*.json')
common config: {'run_name': 'exp1_3', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [100], 'model_type': 'cnn', 'kw': {}}
Now we'll test the effect of skip connections on the training and performance.
Configuratons:
K=[32] fixed with L=8,16,32 varying per run.K=[64, 128, 256] fixed with L=2,4,8 varying per run.So 6 different runs in total.
Naming runs:
Each run should be named exp1_4_L{}_K{}-{}-{} where the braces are placeholders for the values.
TODO: Run the experiment on the above configuration with the ResNet model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_4_L*_K32.json')
common config: {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [100], 'model_type': 'resnet', 'kw': {}}
plot_exp_results('exp1_4_L*_K64*.json')
common config: {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 8, 'hidden_dims': [100], 'model_type': 'resnet', 'kw': {}}
In this part you will create your own custom network architecture based on the CNN you've implemented.
Try to overcome some of the limitations your experiment 1 results, using what you learned in the course.
You are free to add whatever you like to the model, for instance
Just make sure to keep the model's init API identical (or maybe just add parameters).
TODO: Implement your custom architecture in the YourCNN class within the hw2/cnn.py module.
from hw2.cnn import YourCNN
net = YourCNN((3,100,100), 10, channels=[32]*4, pool_every=2, hidden_dims=[100]*2)
print(net)
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_out = net(test_image)
print('out =', test_out)
YourCNN(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.2, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): MaxPool2d(kernel_size=2, stride=1, padding=0, dilation=1, ceil_mode=False)
(2): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.2, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential()
)
(3): MaxPool2d(kernel_size=2, stride=1, padding=0, dilation=1, ceil_mode=False)
)
(mlp): MLP(
(fc_layers): Sequential(
(0): Linear(in_features=307328, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=10, bias=True)
(5): Identity()
)
)
)
out = tensor([[ 9.7312, 0.6260, -14.0102, -15.9070, 7.8336, -1.0848, -0.0262,
6.0746, 12.7328, -10.7362]], grad_fn=<AddmmBackward0>)
"""
cnn_experiment(
'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64,128], layers_per_block=3, pool_every=1, hidden_dims=[100],
model_type='ycn',
)
cnn_experiment(
'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64,128], layers_per_block=6, pool_every=1, hidden_dims=[100],
model_type='ycn',
)
cnn_experiment(
'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64,128], layers_per_block=9, pool_every=1, hidden_dims=[100],
model_type='ycn',
)
cnn_experiment(
'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64,128], layers_per_block=12, pool_every=1, hidden_dims=[100],
model_type='ycn',
)
"""
"\ncnn_experiment(\n 'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,\n filters_per_layer=[32,64,128], layers_per_block=3, pool_every=1, hidden_dims=[100],\n model_type='ycn',\n)\n\ncnn_experiment(\n 'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,\n filters_per_layer=[32,64,128], layers_per_block=6, pool_every=1, hidden_dims=[100],\n model_type='ycn',\n)\ncnn_experiment(\n 'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,\n filters_per_layer=[32,64,128], layers_per_block=9, pool_every=1, hidden_dims=[100],\n model_type='ycn',\n)\n\ncnn_experiment(\n 'exp2', seed=seed, bs_train=16, batches=10, epochs=10, early_stopping=5,\n filters_per_layer=[32,64,128], layers_per_block=12, pool_every=1, hidden_dims=[100],\n model_type='ycn',\n)\n"
Run your custom model on at least the following:
Configuratons:
K=[32, 64, 128] fixed with L=3,6,9,12 varying per run.So 4 different runs in total. To clarify, each run K takes the value of an array with a three elements.
If you want, you can add some extra runs following the same pattern. Try to see how deep a model you can train.
Naming runs:
Each run should be named exp2_L{}_K{}-{}-{}-{} where the braces are placeholders for the values. For example, the first run should be named exp2_L3_K32-64-128.
TODO: Run the experiment on the above configuration with the YourCNN model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp2_2_L6_K32-64-128.json')
No results found for pattern exp2_2_L6_K32-64-128.json.
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs236781.answers import display_answer
import hw2.answers
Analyze your results from experiment 1.1. In particular,
L for which the network wasn't trainable? what causes this? Suggest two things which may be done to resolve it at least partially.display_answer(hw2.answers.part3_q1)
Your answer:
From L=2 to L=16, the increase of depth firtly improves and then damages the accuracy. From some threshold of depth, the network becomes too deep and starts to have bad influence on the training process. The best depth is L=4 with best test accuracy, which can explained by the ability to learn more complex features and the network has appropriate depth.
For L=16 and L=8 the learning process isn't efficient at all and the model has learnt nothing. We guess the reasons might be gradient vanishing and too many pooling layers that vanish the output. The possible solutions to partially fix this include padding the input to increase the dimensions of the output, and adding skip connections, as done in residual blocks.
Analyze your results from experiment 1.2. In particular, compare to the results of experiment 1.1.
display_answer(hw2.answers.part3_q2)
Your answer:
For L=8, we see again that the training/learning process is damaged due to too deep network, no matter what K is. When L=2, we can see that the training results with the filter number of (64, 128) are better and the K=256 has worse performance. This could point out over-fitting caused by too complex features learned with the high number of channels (many filters). Moreover, it can be indicated that deep network (large L) along with many filters results in too complex network, and without proper regularization, tend to over-fit the training set. So, there is a trade-off between the depth (L) and the number of filters (K). In our case the best result corresponds to L=4 and K=128.
Analyze your results from experiment 1.3.
display_answer(hw2.answers.part3_q3)
Your answer:
The network with L=1 has the best performance. When it comes to L=2, L=3 and L=4, the networks cannot learn anything due to too deep architechture or without a proper trade-off between depth and fielter number. In addition.
Analyze your results from experiment 1.4. Compare to experiment 1.1 and 1.3.
display_answer(hw2.answers.part3_q4)
Your answer:
In this experiment, resnet architecture is applied. The most obvious feature resnet is that it overcome the depth limilation and can achieve very good accuracy in very deep architechture. The addition of residual blocks keeps the dimensions of the outputs along the network, which helps overcome the vanishing of the output through deep networks. The best performance goes to (L=8 K=32) and (L=2 K=64-128-256), with proper depth and filter number. In addition, compared with previous experiments, the resnet significantly reduces over-fitting problem and increases learning ability even when the net is really deep like L=8 and K=[64,128,256].
YourCNN class.display_answer(hw2.answers.part3_q5)
(1) We want to solve the issue of vanishing gradients and for that we added the following architecture: We added residual blocks to (approximately) mimic the ResNet behavior. We also added pooling, dropout and batch normalization to train the model easier when using high-depths models.